It’s more times than I can count on one hand. A business or technology leader tells me they have a lot of good data they want to leverage. In the same breath, he/she tells me they have bad data! It does not take a data professional to see they can’t both be true if taken literally. But I get the underlying sentiment.
Invariably, I find it reflects the lack of clarity about the true state of their data. In every single one of these cases, the conclusions from data assessment efforts have surprised them. Not so much because the assessment results run counter to their expectations, but rather because their expectations are so vague.
Don’t get me wrong. This isn’t necessarily meant to be an indictment of these leaders for not understanding what they have in their data. Rather, it is further evidence that data needs to be approached in its own right.
As direct users of data, analytics practitioners—statisticians, data scientists, machine learning engineers, etc.—complain about data all the time. This is because they see data problems more often than anyone else. In many cases, they are the first ones to ever take a deep look at the data at hand.
But we have so much valuable data!
The fundamental challenges analytics practitioners face with data broadly fall into one of the following categories:
- Lack of proper documentation. It is not easy to find documentation about the data. The data dictionary does not exist, it exists but is incomplete or not useful, etc.
- Errors in the data discovered by users. Errors and inconsistencies abound when analysts try to take a deeper look. The data is wrong when they get it, and it falls on their shoulders to clean it. Or they end up declaring it unusable, at which point everyone else complains about their unwillingness to work with such an untapped goldmine. In the meanwhile, the analysts no longer trust the data.
- Data contents that do not make sense. The data does not reflect what the business or researchers understand to be the reality. This is not an error in the technology sense—the data meets all the rules and the logic. But it does not make sense! I was talking to someone not too long ago whose analysis included the patient weight as an important variable. The results did not make sense, and much head-scratching ensued. Later, they discovered the values were either in pounds or in kilograms but without any way to tell which for each patient. Logically speaking, it met the rules: numeric, up to three digits, etc. But we can imagine 100 lb. (approximately 45.5 kg) isn’t exactly 100 kg (approximately 220 lb.) as the patient weight.
These problems invariably trace to something much bigger than technology or analytics. And the scope of the issue is almost always well beyond a project, a tool, or even data itself.
But analytics people should take care of that, right?
Many analytics practitioners don’t even realize data management is an entire discipline in itself. They are unaware there is an entire profession dedicated to data management, just like statistics and data science are disciplines, with its own professional associations, body of knowledge, and so on. They can’t recommend or follow the best practices if they are not aware of them.
Some analytics practitioners think they manage data and have the skill set to manage data. But in reality, they manage data because they have to—in order to do what they really need to do, which is to make something out of data, not to make data error-free.
Many data problems thought to be data quality issues and faced by analytics practitioners are actually data management issues. As an example, one clear sign of data management issues in an organization is when users cannot find documentation about data. In fact, data management encompasses much more than data quality:
- Metadata management: Metadata is data about data; the data dictionary is a classic example.
- Interpretation.
- Data architecture and data modeling: the former is different from technology architecture, and the latter is different from analytics model development.
- Integration.
- Security and privacy (of course!).
- Use (finally!)
Direct participation by analytics practitioners is limited to only a small part of this one last bit: use. It’s a classic case of the tip of the iceberg. There is so much more to data management than making analytics out of data.
What is data management, then?
Data needs to be managed like any other asset. So, data management is asset management. Look at this definition:
“The development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycles.” (DAMA-DMBOK 2nd ed.)
This is not the professional goal of technology or analytics. Neither technology nor analytics manages the entire lifecycle of data or its value.
Data quality does get talked about a good bit. Of course, data security and privacy are hot topics, especially for compliance reasons, though often more from a technology angle. But other important aspects affect the value we get from data: storage, structure, integration, and even interpretation, use, and exploration. Unfortunately, the world is still largely stuck at simply making data available.
Then, these need to be supported by governance, the functional structure of the organization, data strategy, business processes, and finally, technology. This is a very different view of data management from one held by most analytics practitioners I have come across.
Is it a technology thing or an analytics thing?
Technology custodians of data have told me many times that they have good, clean data. It meets all the rules and logic in the system. But I have not once found that to be true from the data user perspective. This is not a knock on the technology people by any means. Rather, they are held responsible, purely by assumption, for something no one has clearly articulated, and in fact for something they do not really do. Their role does not include taking a close look at the data contents in the way data contents need to be looked at. Assigning this responsibility to technology by default is unfair. I’ve heard many technology leaders lament they are blamed for bad data, even though they never even signed up for that responsibility.
On the other hand, analytics practitioners often reactively take on some of the responsibilities of a data function. This is simply because they are often the ones who first find data problems. Worse, they are often the only ones who find any data problems at all. They are often falsely expected to take on this responsibility simply because they are the closest to the data contents. Their skillset and user perspective with which they have had limited success with data cleansing just feeds into this belief. At the risk of upsetting some people, and with some exceptions: data quality for analytics practitioners is a dependency and a constraint. Again, their job is to make something out of data, not to make data error-free.
Who should be responsible for data management?
A key persistent challenge is that no one involved clearly articulates what “data management” or “data quality” means. This happens be it within an organization or in a group of individual collaborators and everything in between. As a consequence, we assume either technology is responsible by custody or analytics is responsible by proximity. And both are frustrated; so is everyone else.
So then, how do you define and assign that responsibility? That’s a whole discussion in itself!